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Abstract 

We consider an agent interacting with an en- 
vironment in a single stream of actions, ob- 
servations, and rewards, with no reset. This 
process is not assumed to be a Markov De- 
cision Process (MDP). Rather, the agent has 
several representations (mapping histories of 
past interactions to a discrete state space) 
of the environment with unknown dynamics, 
only some of which result in an MDP. The 
goal is to minimize the average regret crite- 
rion against an agent who knows an MDP 
representation giving the highest optimal re- 
ward, and acts optimally in it. Recent regret 
bounds for this setting are of order 0{T^/^) 
with an additive term constant yet exponen- 
tial in some characteristics of the optimal 
MDP. We propose an algorithm whose regret 
after T time steps is 0{VT), with all con- 
stants reasonably small. This is optimal in T 
since O^Vt) is the optimal regret in the set- 
ting of learning in a (single discrete) MDP. 



1. Introduction 

In Reinforcement Learning (RL), an agent has to learn 
a task through interactions with the environment. The 
standard RL framework models the interaction of the 
agent and the environment as a finite-state Markov 
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decision process (MDP). Unfortunately, the real world 
is not (always) a finite-state MDP, and the learner of- 
ten has to find a suitable state-representation model: a 
function that maps histories of actions, observations, 
and rewards provided by the environment into a fi- 
nite space of states, in such a way that the resulting 
process on the state space is Markovian, reducing the 
problem to learning in a finite-state MDP. However, 
finding such a model is highly non-trivial. One can 
come up with several representation models, many of 
which may lead to non-Markovian dynamics. Testing 
which one has the MDP property one by one may be 
very costly or even impossible, as testing a statistical 
hypothesis requires a workable alternative assumption 
on the environment. This poses a challenging prob- 
lem: find a generic algorithm that, given several state- 
representation models only some of which result in an 
MDP, gets (on average) at least as much reward as 
an optimal policy for any of the Markovian represen- 
tations. Here we do not test the MDP property but 
propose to use models as long as they provide high 
enough rewards. 

Motivation. One can think of specific scenarios 
where the setting of several state-representation mod- 
els is applicable. First, these models can be discreti- 
sations of a continuous state space. Second, they may 
be discretisations of the paramet er space: this sce- 
nario has been recently considered ( Ortner fc Rvabkol 
2012n for learning in a continuous-state MDP with 
Lipschitz continuous rewards and transition probabil- 
ities where the Lipschitz constants are unknown; the 
models are discretisations of the parameter space. A 
simple example is when the process is a second-order 
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Markov process with discrete observations; in this case 
a model that maps any history to the last two obser- 
vations is a Markov model; a detailed illustration of 
such an exam ple can be found, e.g., in Section 4 of 
( Hutter ■ 2OO9I ). More generally, one can try and ex- 
tract some high-level discrete features from (continu- 
ous, high-dimensional) observations provided by the 
environment. For example, the observation is a video 
input capturing a game board, different maps attempt 
to extract the (discrete) state of the game, and we 
assume that at least one map is correct. Some pop- 
ular classes of models are context trees ( McCalluml . 
19961 ). which are used to capture short-term mem- 
ories, or probabih stic deterministic finite automata 
(IVidal et al.l 120051) , a very general class of models that 
can capture both short-term and long-term memories. 
Since only some of the features may exhibit Markovian 
dynamics and/or be relevant, we want an algorithm 
able to exploit whatever is Markovian and relevant for 
learning. For more details a nd further examples we 



refer to (jMaillard et al.l . l201lh . 



Previous work. This work falls under the frame- 
work of providing performance guarantees on the av- 
erage reward of a considered algorithm. In this setting, 
the optimal regret of a learning algorithm in a finite- 
state MDP is 0(V t). This is t he regret of UCRL2 
(IJaksch et all . |2010[) and Regal . D (jBartlett fc Tewari . 
20091 ). Previous work on this probl em in 



the RL literature inclu d es (IKearns fc Singhl . 12002 ; 
Brafman fc Tennenholtj . l2003l : IStrehl et al.L \20m . 
Moreover, there is currently a big interest in find- 
ing practical state representations for the general RL 
problem where the environment's states and model 
are both unkno wn, e.g. U-trees (McCallum . 19961) 



MC-A IXI-CTW (IVeness et al 
20091) . and PSRs (|Singh'eral 



20111). ^ MDP (jHutted . 
20041) . Another ap- 



proach in which possible models ar e known but need 
not b e MDPs was considered in (jRvabkofc Hutter . 
20081 ). 



For the pro bl em c onsidered in this paper. 



(jMaillard et al.l . 120111 ) recently introduced the 



BLB algorithm that, given a finite set $ of state- 
representation models, achieves regret of order 
^y\^\T'^/■^ (where |$| is the number of models) in 
respect to the optimal policy associated with any 
model that is Markovian. BLB is based on uniform 
exploration of all representation models and uses 
the performance guarantees of UCRL2 to control the 
amount of time spent on non-Markov models. It also 
makes use of some i nternal function in o rder to guess 
the MDP diameter ( Jaksch et al. . 2010[ ) of a Markov 
model, which leads to an additive term in the regret 
bound that may be exponential in the true diameter. 



which means the order T^/"^ is only valid for possibly 
very large T. 

Contribution. We propose a new algorithm called 
QMS (Optimistic Model Selection), that has regret of 
order ■\/|$|T, thus establishing performance that is 
optimal in terms of T, without suffering from an un- 
favorable additive term in the bound and without 
compromising the dependence on |$|. This demon- 
strates that taking into consideration several possibly 
non-Markovian representation models does not signif- 
icantly degrade the performance of an algorithm, as 
compared to knowing in advance which model is the 
right one. The proposed algorithm is close in spirit 
to the BLB algorithm. However, instead of uniform ex- 
ploration it uses the principle of "optimism" for model 
selection, choosing the model promising the best per- 
formance. 

Outline. Section [5] introduces the setting; Section |3] 
presents our algorithm OMS; its performance is anal- 
ysed in Section H] proofs are in Sections [5j and Sec- 
tion [S] concludes. 

2. Setting 

Environment. For each time step t — 1,2,..., let 

Tit ■= O X {A X TZ X Oy~^ be the set of histories up 
to time t, where O is the set of observations, ^ is a 
finite set of actions and 71= [0,1] is the set of possible 
rewards. We consider the problem of reinforcement 
learning when the learner interacts sequentially with 
some unknown environment: first some initial obser- 
vation hi = oi Cz Hi = O is provided to the learner, 
then at any time step t > 0, the learner chooses an 
action at (z A based on the current history ht € Jit, 
then receives the immediate reward rt and the next 
observation ot+i from the environment. Thus, ht+i is 
the concatenation of ht with {at,rt,ot+i). 

State representation models. Let $ be a set of 

state-representation models. A state-representation 
model (/) G $ is a function from the set of histories 
H := Ut>i^t to ^ finite set of states S^. For a 
model 0, the state at step t under (f> is denoted by 
St,4> ■= </'(^t) or simply st when (f) is clear from context. 
For the sake of simplicity, we assume that D S^' = 
for (f) 7^ (f)'. Further, we set S := U^e*'^^. 

A particular role will be played by state-representation 
models that induce a Markov decision process (MDP). 
An MDP is defined as a decision process in which at 
any discrete time t, given action at, the probability of 
immediate reward rt and next observation Ot+i, given 
the past history ht, only depends on the current obser- 
vation oj. That is, P{ot+i,rt\htat) ^ P{ot+i,rt\ot,at). 
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Observations in this process are called states of the en- 
vironment. We say that a state-representation model (f> 
is a Markov model of the environment, if the process 
{st,<p,at,rt),t e N is an MDP. This MDP is denoted 
as M{(f>). We will always assume that such MDPs 
are weakly communicating, that is, for each pair of 
states xi,X2 there exists k £ N and a sequence of ac- 
tions ai,...,ak G A such that P{sk+i,4, = X2|si,0 = 
xi,ai = ai,...,afc = ak) > 0. It should be noted 
that there may be infinitely many state-representation 
models under which an environment is Markov. 

Problem description. Given a finite set $ which 
includes at least one Markov model, we want to con- 
struct a strategy that performs as well as the algo- 
rithm that knows any Markov model including 
its rewards and transition probabilities. For that pur- 
pose we define for any Markov m odel € the re- 

\2mdt 



gret of any strategy at ti i ne T. cf. (iJaksch et a— - ^ 
Bartlett fc Tewari l2009t iMaillard et al.l . l201ll ). as 



A(0,T) :=rp*(0)-^ 



where rj are the rewards received when following 
the proposed strategy and p*{4>) is the optimal av- 
erage reward in cj), i.e., p*{(j)) ■= p(M(0),7rJ) := 

limT->oo yE[I]Li '^iC'^^)] where ni-K^) are the re- 
wards received when following the optimal policy tt^ 
for (j}. Note that for weakly communicating MDPs 
the optimal average reward indeed does not depend 
on the initial state. One could replace Tp*{(j)) with 
the expected sum of rewards obtained in T steps (fol- 
lowing the optimal policy) at the price of an additional 
0{VT) term. 

3. Algorithm 

High-level overview. The QMS algorithm we pro- 
pose (shown in detail as Algorithm [T]) proceeds in 
episodes k ~ 1,2,..., each consisting of several runs 
J = 1, 2, . . .. In each run j of some episode fc, starting 
at time t = t^j, QMS chooses a policy iTkj applying 
the optimism in face of uncertainty principle twice. 
First, in line 6, QMS considers for each model (/) G $ a 
set of admissible MDPs A^t,^ (defined via confidence 
intervals for the estimates so far), and computes a so- 
called optimistic MDP M^{(p) G Ait.cf, and an asso- 
ciated optimal policy 'n'^{4>) on AIj^{(j>) such that the 
average reward p{Mj^ {(j)) , tt^ {(j))) is maximized. Then 
(line 7) QMS chooses the model (j)k.j G <f> which maxi- 
mizes the average reward TTkj TT^{(j)k,j) penalized 
by a term intuitively accounting for the "complex- 
i ty" of the model, simila r to the REGAL algorithm of 
(|Bartlett fc Tewaril . [2OO9I) . 



Algorithm 1 Optimistic Model Selection (OMS) 
Require: Set of models $0, parameter 5 G [0, 1]. 
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Set t:=l,k := 0, and $ $0. 
while true do 

k := k + 1, j := 1, sameEpisode 

while sameEpisode do 



true 



t 



k,j • = 



t 



V0 e use EVI to compute optimistic MDP 
M^{(f>) G A^t,0 and (near-)optimal policy 
7r^((/)) with approximate optimistic average 
reward pt,, ^ {<!>)■ 

Choose model (pk,j G $ such that 



argmaxjp^" (<?!))-pen(<?!); tk.j)] 



(1) 



8 
9 

10 

11 

12: 



Define pk,j := pl'^ ^{<j)k,j),'^k,j ■= 7r+_^. ((/)fcj). 
sameRun := true, 
while sameRun do 

Choose action at := nk.jist), get reward rt, 
observe next state st+i € Sk,j ■= S^^ . . 
Set testFail := true iff the sum of the col- 
lected rewards so far from time tkj is less 
than 

4jPfcj -lobfcj(i), (2) 

where £kj :— t ~ tkj + 1. 
if testFail then 

sameRun := false, sameEpisode := false 

if $ = then $ := $0 end if 
else if Vk{st,at) = Nt^{st,at) then 

sameRun := false, sameEpisode := false 
else if £kj = 2^ then 

sameRun := false, j := j + 1 
end if 
t:=t + l 
end while 
end while 
end while 



The policy nkj is then executed until either (i) run j 
reaches the maximal length of 2^ steps (line 19), 
(ii) episode fc terminates when the number of visits 
in some state has been doubled (line 17), or (iii) the 
executed policy nkj does not give sufficiently high av- 
erage reward (line 12). Note that OMS assumes each 
model to be Markov, as long as it performs well. Oth- 
erwise the model is eliminated (line 15). 

Details. We continue with some details of the algo- 
rithm. In the following, S,j, :— \S,f, \ denotes the number 
of states under model <f>, S := \S\ is the total number of 
states, and A := \A\ is the number of actions. Further, 
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6t := 6/36t^ is the confidence parameter for time t. 



where the constants are given by 



Admissible models. First, the set of admissible 
MDPs Ait.cf) the algorithm considers at time t for each 
model G $ is defined to contain all MDPs with state 
space and with rewards r and transition probabili- 
ties p satisfying 



||p(-|s,a) -pt{-\s,a)\\-^ 



Nt{s,a) 



\r{s,a)-Ms,a)\ ^ ^i^il^fi^ , (4) 

where pt{-\s,a) and rt{s,a) are respectively the em- 
pirical transition probabilities and mean rewards (at 
time t) for taking action a at state s, and 7Vt(s,a) is 
the number of times action a has been chosen in state s 
up to time t. (If a hasn't been chosen in s so far, we 
se t Nt(s,a) to 1 . ) It c an be shown (cf. Appendix C.l 
of Jaksch et al.l ( 2010[ )) that the mean rewards r and 
the transition probabilities p of a Markovian state- 
representation (p satisfy ([3]) and (jlj at time t for all 
s G and a G A, each with probability at least 
1 — St, making Markov models admissible with high 
probability. 

Extended Value Iteration. For computing a near- 
optimal policy T^t{4>) and a corresponding optimistic 
MDP Mi+((^) g Mt,^ (line 6), OMS applies for each cj} g 



$ extended value iteration (EVI) (jJaksch et al. 
with precision parameter t^^^^. EVI computes opti- 



20T3) 



mistic approximate state values u^^ 

just like ordinary value iteration ( Puterman 



19941 ) with an additional optimization step for choos- 
ing the transition kernel maximizing the average re- 
ward. The (approximate) average reward ^{4>) of 
7r^(0) in {(/)) then is given by 

p^{(j)) = min|r+(s,7r+((?:),s)) 

+ ^p+(s» utjs') ~ <^(5), seS^], (5) 



where and pf are the rewards and transition prob- 
abilities of Mf((t>) under ir^ {(!)). It can be shown 
(|Jakschet allboiol) that p+((/)) ^ p*{(t>) - 2/Vi. 



Penalization term. At time t = t^j, we define the 
empirical value span of the optimistic MDP M^{(j)) as 
sp(ut,/,) := maxsgs^ u+^(s)-min,e5^ '"^^(s), and the 
penalization term considered in ([T]) for each model (j) 
is given by 



pen(0;t) := 



2-^/2 c(0;t)sp(u+^) 



:'(0;t) + 2-^"sp(u+^), 



c{<t>;t) := 2^/25^^1og(2'5*5^^i/5,) + 2^/21og(i), 



c'(</);i) := 2^2S^A\og{2S^At/5t). 



Deviation from the optimal reward. Let ik,j '■= 
t — tk.j + 1, and Vkj{s, a) be the total number of times 
a has been played in s during run j in episode k (or 
until current time t if j is the current run). Similarly, 
we write Vk{s,a) for the respective total number of 
visits during episode k. (Note that by the assumption 
S^OS^' = for 7^ 0', the state implicitly determines 
the respective model.) Then for the test ([2]) that de- 
cides whether the chosen model (f>k,j gives sufficiently 
high reward, we define the allowed deviation from the 
optimal average reward in the optimistic model for any 
t ^ tk,j in run j as 



lobfc,,(t):=2 ^ ^ J2z;fc,,(s,a)log( 



2SkjAik 

St, ■ 



2sp+,J24.jlog(lA,J+sp+,, 



(6) 



where sp+^. := sp(u+ ^. and Skj := In- 
tuitively, the first two terms correspond to the esti- 
mation error of the transition kernel and the rewards, 
while the last one is due to stochasticity of the sam- 
pling process. 



4. Main result 

We now provide the main result of this paper, an upper 
bound on the regret of our OMS strategy. The bound in- 
volves the diameter of a Markov model </>, D{(j)), which 
is defined as the expected minimum time required to 
reach any st ate starting f r om a ny other state in the 



MDP M{(j)) IJaksch et al.l . I2OIOI ) 



Theorem 1 Let (p* he an optimal model, i.e. <j)* G 
argmax { p*((/)) I is Markovian^. Then the 

regret A{(f>*,T) of OMS (with parameter 6) w.r.t. (j)* 
after any T ^ SA steps is upper bounded by 



A\og 



{AS+m)T+{AS+\'P\)\og{^) 
+ {p* + D*) {AS +m)log%§^ 

with probability higher than 1 — (5, where p* := 
S* -.= 8^*, and D* :=£>(</'*). 
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In particular, if for G $, 5*0 ^ B, then S ^ B\^\ 
and hence with high probability 



A{(I)*,T) = 0(L»MS3/2y|$|T) 



Comparison with the BL B algorithm. Comp ared 



to the results obtained by (jMaillard et al.l . 120111) the 



regret bound in Theorem [T] has improved dependence 
of T^/^ (instead of T^/^) with respect to the horizon 
(up to logarithmic factors). Moreover, the new bound 
avoids a possibly large constant for guessing the di- 
ameter of the MDP representation, as unlike BLB, the 
current algorithm does not need to know the diam- 
eter. These improvements were possible since unlike 
BLB (which uses uniform exploration over all models, 
and applies UCRL2 as a "black box") we employ opti- 
mistic exploration of the models, and do a more in- 
depth analysis of the "UCRL2 part" of our algorithm. 

On the other hand, wc lose in lesser parameters: the 
multiplicative term in the new bound is S*Ay/S 
S*Ay/\^\B (assuming that all representations in- 
duce a model with no more than ^ B states), 
whereas the correspo nding factor in the bound of 



(Maillard et al. 



20111) is S*^/A\¥\. Thus, we cur- 
rently lose a factor y/AB. Improving on the depen- 
dency on the state spaces is an interesting question: 
one may note that the algorithm actually only chooses 
models not much more complex (in terms of the diame- 
ter and the state space) than the best model. However, 
it is not easy to quantify this in terms of a concrete 
bound. 

Another interesting question is how to reuse the in- 
formation gained on one model for evaluation of the 
others. Indeed, if we are able to propagate informa- 
tion to aU models, a log(|<I>|) dependency as opposed 
to the current seems plausible. However, in the 

current formulation, a policy can be completely unin- 
formative for the evaluation of other policies in other 
models. In general, this heavily depends on the inter- 
nal structure of the models in i>. If all models induce 
state spaces that have strictly no point in common, 
then it seems hard or impossible to improve on ■\/[$f. 

We also note that it is possible to replace the diameter 
in Theorem [T] with the span of th e optimal bias vec 



tor ju st as for the REGAL algorithm (IBartlett fc Tewari 



20091) by suitably modifying the OMS algorithm. How- 
ever, unlike UCRL2 and OMS for which computation of 
optimistic model and respective (near-) optimal policy 
can be performed by EVI, this modified algorithm (as 
REGAL) relies on finding the solution to a constraint 
optimization problem, efficient computation of which 
is still an open problem. 



5. Regret analysis of the OMS strategy 

The proof of Theorem [T] is divided into two parts. In 
Section 15.11 we first show that with high probability 
all Markovian state-representation models will collect 
sufficiently high reward according to the test in ([2]). 
This also means that the regret of any Markov model 
is not too large. This in turn is used in Section [HH] to 
show that also the optimistic model employed by OMS 
(which is not necessarily Markov) docs not lose too 
much with respect to an optimal policy in an arbitrary 
Markov mode l . In o ur pro of we use analysis similar to 
(Ijakschet al.l . l2Q10[ ) and (iBartlett fc Tewari l2Q09l ) . 



5.1. Markov models pass the test in ^ 

Assume that S $ is a Markov model. We are going 
to show that (f)k.j will pass the test on the collected 
rewards in ([2]) of the algorithm at any step t w.h.p. 

Initial decomposition. First note that at 
time t when the test is performed, we have 
EseSfc,, Eae^«fcj (S'«) = 4 J = t - tk,j -t- 1, so that 



t 



£kjPk 



Y^k.jis,a)(^pkj -rt^^y.tis.a)^ , (7) 



where rtk j-.tis, a) is the empirical average reward col- 
lected for choosing a in s from time tkj to the current 
time t in run j of episode fc. Let r^^ (s,a) be the 
optimistic rewards of the model M^^ . {4>k.j) under pol- 
icy TTkj and P^^ the respective optimistic transition 
matrix. Set -Vkj {vk.j{s,Trk,j{s)))s G and let 



+ 

tk,j ,4>k,j 



{s))s € M ''■^ be the state value vec- 



tor given by EVI. By ^ and noting that Vk,j{s, a) = 
when a ^ iTkjis) or s ^ Sk.j, we get 

t 

^k,jPk,o -^Tr ==Yvk,jis,a){plj{(l3k,j) - r+jis,a)) 

T=tk,j S.a 



S$ V 



- / u 



+ Yvkjis,a)(r+.{s,a)-rt^^.,t{s,a)y (8) 



We continue bounding each of the two terms on the 
right hand side of (|8]) separately. 

Control of the second term. Writing r(s, a) for the 
mean reward for choosing a in s (this is well-defined. 
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since we assume the model is Markov), we have 

+ (?^tfc,,(s7a) -r(s,a)) + {r{s,a) -?t„ , ..1(3,0)) . 

The terms of this decomposition are controhed. That 
is, using that M{(j)k.j) is an admissible model accord- 
ing to (HJ with probability 1 — 6t^ - (by applying the 
results of measure c oncentration in Appendix C.l of 
(jjaksch et all . I2OIOI ) to the quantity rt^,j{s,a)), and 
the mere definition of r^j{s,a), and since Nt^{s,a) < 
Nt{s,a), we deduce that with probability higher than 
1 - St.,, 



+ (^tfc,,(s>a) -''(•s^a))) 



^ \/2iVt Js,a) 



I02 



St. , 



Ev2^;,,,(.,a)log(H%^ 



(9) 



On the other hand, using again the re sults of mea- 



sure concentration in Appendix C.l of (jJaksch et al 



201Clf ). and that Vk,j{s,a) < Nt^{s,a) ^ tkj, we de- 
duce by a union bound over Sk.jAtk.j events that with 
probability higher than 1 ~ 5t^ ■ we get 



Yvkj{s,a)[r{s,a)-?tt,,y.t{s,a) 



E 



Vk,j{s,a) 



v/2wfcj(s,a) 



^j2.,,(.,a)log(H%^) 



(10) 



Control of the first term. For the first term in ([5|), 
let us first notice that, since the rows of - sum to 1, 
(P^^ — j is invariant under a translation of the 
vector u^^ . In particular, we can replace u^^- with the 
quantity h^^ , where 

^fcj(^) := "fej(^) - ™n { u+^{s) I s € Sk.j } . 
Then, we make use of the decomposition 

<,{ph-iH,= (11) 

where Fk,j denotes the transition matrix correspond- 
ing to the MDP M{(j)k,j) under pohcy TTkj- Since 



both matrices are close to the empirical transition ma- 
trix Ptj. ^ at time tk.j, we can control the first term of 
this expression. 

First part of the first term. Indeed, since spt , = 
llh^jlloo, we have for the first term in (fTTj) . using 
the decomposition Pfcj(-|s) - Pk,j{-\s) = {pt.j{-\s) - 
Ptfc,,(-|s)) + (Pu,,(-|s) -Pfcj(-|s)) together with a con- 
centration result and the definition of j, that with 
probability higher than 1 — St,. ■ 



(12) 



«S Yvk,]{s,a) ||Pfcj(-|s) "Pfcj(-|s)||i ■ ||h+^ 



J Moo 



^ Y^O / ^ /21og(2^'=.^Sfc,,Atfc,,/5t ) II + 



Second part of the first term. The second term of 
(|lip can be rewritten using a martingale difference se- 
quence. That is, let e,, G M.^''-^ be the u nit vector with 
coord inates for all s' ^ s. Following (jjaksch et al 



2OIOI ) we set Xr := (p(-|s^,ar) - ej^^jh^_^ and get 

(13) 



v^,^.(P,,,-/)h+^. 



= Y: (p{-\s^,ar)-el)\,t., 
t 

T = tfc,j 
t 

= E + u+j(st+i) - <j (st, J . 



Now the sequence {X^-^t^. ^r^t is a martingale differ- 
ence sequence with 

\Xr\ sS ||p(-|s^,a^) -eJ^^J|^sp+^. 2sp+j . 

Thus, an application of Azuma-Hoeffdi ng's inequal 



ity (cf . Lemma 10 and its application in iJaksch et al 
(l2010l) 'l to ^ yields 



^ 2sp+^. ^24., log(l/<5t,.^. ) + sp+^. (14) 
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with probabihty higher than 1 — St^ j ■ Together with 
this eoncludes the control of the first term of ([S]) . 

Putting all steps together. Combining ([8|), ([9|), 
pO)) . pT|) . ((T2|) . and (HH), we deduce that at each time t 
of run j in episode k, any Markovian model 4>k.j passes 
the test in ([2]) with probability higher than 1 — iSt^ ^ ■ 
Further, it passes all the tests in run j with probability 
higher than 1 — ASt^ ■ 2^ . 

5.2. Regret analysis 

Next, let us consider a model 4>k,j G not necessarily 
Markovian, that has been chosen at time tkj ■ Let t + 1 
be the time when one of the three stopping conditions 
in the algorithm (lines 12, 17, and 19) is met. Thus QMS 
employs the model 4'k.j between tkj and t + 1, until 
a new model is chosen after the step t + 1. Noting 
that Tr e [0, 1] and that the total length of the run is 
{t + 1) — tk,j + 1 = £k,j + 1 we can boimd the regret 
A/; J of run j in episode k by 

f+i 

r=tk,j 

t 

< 4 J - Pkj) + p* + ik,jPk,j - ■ 

Since by assumption the test in ^ has been passed 
for all steps r € [tkj , i] , we have 

Afc J ^ 4 J {p* - Pk.j) + + lobfe J (<), (15) 

and we continue bounding the terms of lobj, j(i). 

Stopping criterion based on the visit counter. 

Since a'^k.^is^a) = ^ 2^, by Cauchy-Schwarz 
inequality Es,a V'"fej(s>a) ^ 2^ /'^ ,/S^~~A. Plugging 
this into the definition ([6]) of lobfc j-, we deduce from 
(IT51) that 



lently 



Afcj < ek,j{p* - Pk^j) + P* (16) 
+sp+, + 2-'/2sp+,c((/.fc,,;tfc,,) + 2^/2c'(0fc,,;tfc,,) . 



Selection procedure with penalization. Now, by 
definition of the algorithm, for any optimal Markov 
model (j)* defined in the statement of Theorem[l] when- 
ever M{(f)*) is admissible, i.e. M{(j)*) G A^tfc.j,^* ^'^'^ 
was not eliminated during all runs before run j in 
episode k, we have pk^j - pen{(l)k,j;tk,j) > (</>*) - 
pen((/)*; tkj) ^ p* - pen(0*; i^j) - 2t^ ' , or equiva- 



P* - Pkj ^ pen(0*; tk^j) - pen{(j)k^j;tkj) + 2t^.J 
< 2-^/2c(0*;ife,,)sp(u+^.^^O 

+2-^/2c'(0^t,,,) + 2-^sp(u+^.^^.) 
-2-^/2c(0fcj;tfc,,)sp+^. 

-2-^/2c'(^fc,,;ifc.,) - 2-^sp+^. + 2t-y^ (17) 

Noting that ik.j ^ 2-' and recalling that when M{(p*) 
is admissible, the span of the corresponding optimistic 
model is less than the di ameter of the true m odel, i.e. 
sp(u+^ ,^,) ^ D*, see (|jaksch et all . l2010f ). and we 
obtain from (fT6|) . (fT7|) . and a union bound that 

+ + 2^/^D*c 

+ 2='\' 



A 



i*;4,i) 



_2.+ii-i/2 



with probability higher than 
1 - 



E 



(18) 
(19) 



The sum in ([T9|) comes from the event that 0* passes 
all tests (and is admissible) for all runs in all episodes 
previous to time tkj, and 2dtk j comes from the event 
that (j)* is admissible at time tkj. We conclude in the 
following by summing A.k.j over all runs and episodes. 

Summing over runs and episodes. Let Jk be the 

total number of runs in episode fc, and let Kt be the 
total number of episodes up to time T. Noting that 
cir;tk,j) c(r;T) and c'(r;4j) ^ c'(0*;r) as 
weh as using that 2tk,j ^ 2^ (so that 2J+H^V2 ^ 
2^/2 ■ 2^/^), summing (fTS]) over all runs and episodes 
gives 

Kt Jk Kt 

k=l j=l k=l 

Kt Jk 

D*c{(j,*;T) + c'(0*; T) + 2^2) ^ ^ V'^, 

k=ij=i 

with probability higher than 1 — Y^k=i 4(5tfc ,j2-', 

where we used a union bound over all events considered 
in for the control of all the Afc j terms, avoiding 
redundant counts (such as the admissibility of (j)* at 
time tfcj ). Now, using the definition of St^ ^ and the 
fact that 2tfcj- ^ 2^ , we get that 



4'5tfc.,2^ 



2^(5 



23 5 



2tk,j{tk,j + 23) 
S S 



2tk,j 2(tfcj + 2J) 



E 



2*2 ' 
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where the last iiicquahty follows by a series-integral 
comparison, using that t i— > t~'^ is a decreasing func- 
tion. Thus, we deduce that the bound is valid 
with probability at least 1 — X)t^i ^ ^'^^ 
and it remains to bound the double sum 2-'/^. 

Prom the number of runs... First note that by 
definition of the total number of episodes Kt we must 
have 

Kt Jk-1 Kt 



k=l 



(21) 



k=l 3 = 1 

which implies also that we have the bound 

Kt Jk Kt 



2^" - 2 



2Kt < 2T + 2Kt. 



k=l j = l k=l 

Further, by Jensen's inequality we get 



Kt Jk-l 

E E 

k=i j=i 



^ ^HkZiJk V2T + 2AT. (22) 



Now, to bound the total number of runs Efc^Ti '-^k, us- 
ing Jensen's inequality and ()2ip . we deduce 



Kt Kt -, Kt 

Y.Jk ^ I°g2 C^'- ) ^ Kt log2 {j^Y.^'" 

fc=l fc=l fe=l 

^ KT\og,{^ + 2) ^Kt log, (j^), (23) 

and thus it remains to deal with Kt- 

... to the number of episodes. First recall that an 
episode is terminated when either the number of visits 
in some state-action pair (s,a) has been doubled (line 
17 of the algorithm) or when the test on the accumu- 
lated rewards has failed (line 12). We know that with 
probability at least 1 — 5 the optimal Markov model is 
not eliminated from $, while non-Markov models fail- 
ing the test are deleted from $. Therefore, with prob- 
ability 1 — d the number of episodes terminated with 
a model failing the test is upper bounded by |$| — 1. 

Next, let us consider the number of episodes which 
are ended since the number of visits in some state- 
action pair (s,a) has been doubled. Let K(s,a) be 
the number of episodes which ended after the number 
of visits in (s, a) has been doubled, and let T(s, a) 
be the number of steps in these episodes. As it 
may happen that in an episode the number of vis- 
its is doubled in more than one state-action pair, 
we assume that K{s, a) and r(s, a) count only the 
episodes/steps where (s, a) is the first state-action pair 



for which this happens. It is easy to see that K{s, a) ^ 
1 + log2 T(s, a) = log2 2r(s, a) for T(s, a) > 0. Then 
the bound Ese5 Eae^ ^*^§2 2T'(s, a) on the total num- 
ber of these episodes is maximal under the constraint 
EsesEa^ATis^a) = T when T{s,a) = ^ for ah 
(s,a). This shows that the total number of episodes 
Kt is upper bounded by 



Kt ^ SA\og,{^) + m~l (24) 

with probability 1 — S, provided that T ^ SA. 

Putting all steps together. Combin- 
ing (Uni), (ESI and (US]) we get A((j)*,T) 



2y/2)^2KT\og,{^){T + KT). Hence, by ([Ml) and 
the definition of c, c', the regret of QMS is, with 
probability higher than 1—5, bounded by 

A(0^T) ^ (/ + i?*)(5^+|$|)log^(|5) 



21)* V 25'* A log 



,.12" 245* AT-' 



2D*j2\og{^) 



2V2 



+ 2^25*Alog(48^ 
X log2 (15) (y'(5A+|$|)2T+ {SA +1$!) log2 ( 



and we may conclude the proof with some minor sim- 
plifications. 

6. Outlook 

The first natural question about the performance guar- 
antees obtained is whether they are optimal. We 
know from the corresponding lo wer-bounds for learn- 
ing MDPs ( Jaksch et al. . 2010f ) that the dependence 
on T we get for QMS is indeed optimal. Among other 
parameters, perhaps the most important one is the 
number of models |<i>|; here we conjecture that the 
•y/[$f dependence we obtain is optimal, but this re- 
mains to be proven. Other parameters are the size of 
the action and state spaces for each model; here we 
lose with respect to the precursor BLB algorithm (see 
the remark after Theorem [T]), and thus have room for 
improvement. It may be possible to obtain a better 
dependence for QMS at the expense of more sophisti- 
cated analysis. Note, however, that so far there are no 
known algorithms for learning even a single MDP that 
would have known optimal dependence on all these pa- 
rameters. 

Another important direction for future research is infi- 
nite sets 4> of models; perhaps, countably infinite sets 
is the natural first step, with separable — in a suitable 
sense — continuously-parametrized general classes of 
models being a foreseeable extension. A problem with 
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the latter formulation is that one would need to formal- 
ize the notion of a model being close to a Markovian 
model and quantify the resulting regret. 
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